Controlling computer storage systems

ABSTRACT

Goal-based availability and change management are handled over groups of heterogeneous storage controllers. Probabilistic and deterministic methods are employed to determine the allocation and placement of storage volumes to storage controllers, as well as the degree of data redundancy necessary to achieve data availability goals. The probabilistic methods can take into account past observations of controller availability, and operator beliefs, as well as the state of storage controller configuration, in coming up with a probabilistic estimate of future availability.

FIELD OF THE INVENTION

The present invention generally relates to information technology, and,more particularly, to controlling computer storage systems.

BACKGROUND OF THE INVENTION

The need for scaling the capacity, availability, and performance ofdatasets across multiple direct-access storage devices (DASDs) led tothe development of the Redundant Array of Inexpensive (or Independent)Disks (RAID) technology in the early 1980s, and the implementation ofstorage controllers that offer RAID-based logical disk abstractions.These storage controllers are typically computer servers attached to alarge number of DASDs via a peripheral I/O interconnect. They form RAIDarrays by combining groups of DASDs and subsequently create and exportlogical disk abstractions over these RAID arrays. The RAID technologyprotects against data loss due to DASD failure by replicating dataacross multiple DASDs and by transparently reconstructing lost data ontospare DASDs in case of failure. Depending on the degree of overallstorage controller availability desired (which directly affects cost),storage vendors have several options regarding the reliability andredundancy of components used when designing storage controllers.Besides the reliability of hardware components, the quality of thesoftware that implements failure recovery actions is important to theoverall availability level provided by a storage controller. The RAIDtechnology is one of many approaches to using data redundancy to improvethe availability, and potentially the performance, of stored data sets.Data redundancy can take multiple forms. Depending on the level ofabstraction in the implementation, one can distinguish betweenblock-level redundancy and volume-level replication. Block-levelredundancy can be performed using techniques such as block mirroring(RAID Level 5), parity-based protection (RAID Level 10), or erasurecoding. See R. Bhagwan et al., “Total Recall: System Support forAutomated Availability Management,” in Proc. of USENIX Conference onNetworked Systems Design and Implementations '04, San Francisco, Calif.,March 2004.

Block-level redundancy operates below the storage volume abstraction andis thus transparent to system software layered over that abstraction. Incontrast, volume-level replication, which involves maintaining one ormore exact replicas of a storage volume, is visible (and thus must bemanaged) by system software layered over the storage volume abstraction.Known technologies to perform volume-level replication include, e.g.,FlashCopy® computer hardware and software for data warehousing, for usein the field of mass data storage, from International Business MachinesCorporation, and Peer-to-Peer Remote Copy (PPRC).

Manual availability management in large data centers can be error proneand expensive and is thus not a practical solution. RAID (see D.Patterson et al., “A Case for Redundant Arrays of Inexpensive Disks(RAID),” Proceedings ACM SIGMOD, Chicago, June 1988) systems, whichemploy data redundancy to offer increased availability levels overgroups of DASDs, operate in mostly a reactive manner and are typicallynot goal-oriented. Also, they may not easily extend from singlecontrollers to systems of multiple storage controllers. The ChangeManagement with Planning and Scheduling (CHAMPS) system, described in A.Keller et al., “The CHAMPS System: Change Management with Planning andScheduling”, IBM Technical Report 22882, Aug. 25, 2003, is concernedwith how a given change (e.g., a software upgrade of a component) in adistributed system affects other system components and on how toefficiently execute such a change by taking advantage of opportunitiesfor parallelism. CHAMPS tracks component dependencies and exploitsparallelism in task graph. While representing a substantial advance inthe art, CHAMPS may have limitations regarding consideration of serviceavailability and regarding data availability in distributed storagesystems.

There is little prior work on automated availability management systemsin environments involving multiple, heterogeneous storage controllers.The Hierarchical RAID (HiRAID) system (see S. H. Baek et al.,“Reliability and Performance of Hierarchical RAID with MultipleControllers,” in Proc. 20th ACM Symposium on Principles of DistributedComputing (PODC 2001), August 2001) proposes layering a RAID abstractionover RAID controllers, and handling change simply by masking failuresusing RAID techniques. HiRAID may not be optimally goal-oriented and mayfocus on DASD failures only (i.e., as if DASDs attached to all storagecontrollers were part of a single DASD pool). It may not take intoaccount the additional complexity and heterogeneity of the storagecontrollers themselves and thus may not be appropriate in somecircumstances.

Other approaches may also inadequately characterize storage controlleravailability. For example, Total Recall (see R. Bhagwan et al., “TotalRecall: System Support for Automated Availability Management”, in Proc.of USENIX Conference on Networked Systems Design and Implementations'04, San Francisco, Calif., March 2004) characterizes peer-to-peerstorage node availability simply based on past behavior and treats allnodes as identical in terms of their availability profiles; it is thusmore appropriate for Internet environments, which are characterized bysimple storage nodes (e.g., desktop PCs) and large “churn”, i.e., largenumbers of nodes going out of service and returning to service at anytime, rather than enterprise environments and generally heterogeneousstorage controllers. Another related approach applies Decision Analysistheory to the design of archival repositories. See A. Crespo and H.Garcia-Molina, “Cost-Driven Design for Archival Repositories,”Proceedings of the 1st ACM/IEEE-CS Joint Conference on DigitalLibraries, Roanoke, Va., 2001. This is a simulation-based designframework for evaluating alternatives among a number of possibleconfigurations and choosing the best alternative in terms of reliabilityand cost. Prior work within this framework, however, has not addressedthe heterogeneity and complexity issues in large scale storage systemsor the problem of storage volume placement on a set of storagecontrollers.

Existing provisioning systems such as IBM's Volume Performance Advisor(VPA) take into account capacity and performance considerationsprimarily when recommending volume allocations. While VPA represented asubstantial advance in the art, it may not have appropriate provisionfor availability goals.

It would thus be desirable to overcome the limitations in previousapproaches.

SUMMARY OF THE INVENTION

Principles of the present invention provide techniques for controlling acomputer storage system. In one aspect, an exemplary method includes thesteps of obtaining deterministic component availability informationpertaining to the system, obtaining probabilistic component availabilityinformation pertaining to the system, and checking for violation ofavailability goals based on both the deterministic componentavailability information and the probabilistic component availabilityinformation.

In another aspect, an exemplary method includes the steps of obtaining arequest for change, obtaining an estimated replication time associatedwith a replication to accommodate the change, and taking the estimatedreplication time into account in evaluating the request for change. Themethods can be computer-implemented. The methods can advantageously becombined.

One or more embodiments of the invention can be implemented in the formof a computer product including a computer usable medium with computerusable program code for performing the method steps indicated.Furthermore, one or more embodiments of the invention can be implementedin the form of an apparatus including a memory and at least oneprocessor that is coupled to the memory and operative to performexemplary method steps.

One or more embodiments of the invention may provide one or morebeneficial technical effects, such as, for example, automatic managementof availability and performance goals in enterprise data centers in theface of standard maintenance and/or failure events, automatic managementof storage consolidation and migration activities, which are standardparts of IT infrastructure lifecycle management, and the like.

These and other objects, features and advantages of the presentinvention will become apparent from the following detailed descriptionof illustrative embodiments thereof, which is to be read in connectionwith the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flowchart showing exemplary method steps according to anaspect of the invention;

FIG. 2 shows an example of initial volume placement according toprobabilistic information;

FIG. 3 shows an example of volume replication for availability accordingto probabilistic information;

FIG. 4 shows an example of initial volume placement according todeterministic information;

FIG. 5 shows an example of volume replication for availability accordingto deterministic information; and

FIG. 6 depicts a computer system that may be useful in implementing oneor more aspects and/or elements of the present invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

FIG. 1 shows a flowchart 100 with exemplary method steps for controllinga computer storage system, according to an aspect of the invention. Themethod can be computer-implemented. Step 102 includes obtainingdeterministic component availability information pertaining to thesystem, for example, a calendar time t and a duration Dt associated witha request for change RFC. The request for change RFC (t, Dt) can beadded into a downtime schedule. Step 104 includes obtainingprobabilistic component availability information pertaining to thesystem, e.g., an estimated controller failure probability. In decisionblock 106, a check is made for violation of availability goals, based onboth the deterministic component availability information and theprobabilistic component availability information.

As shown at the “NO” branch of block 106, an additional step includesmaintaining the current status, responsive to the block 106 checkingindicating no violation of the availability goals. In the case of the“YES” branch of block 106, additional step 108 includes determiningreplication parameters, responsive to the block 106 checking indicatinga violation of the availability goals. The replication parameters caninclude at least how to replicate and where to replicate.

As noted, obtaining deterministic component availability informationpertaining to the system can include obtaining a request for change.Step 110 can include obtaining an estimated replication time. Decisionblock 112 can be employed to take the estimated replication time intoaccount in evaluating the request for change. Specifically, in block112, in can be determined whether sufficient time is available toreplicate to accommodate the request for change. At block 114,responsive to said determining step indicating that sufficient time isnot available, the request for change can be rejected, and/or analternative plan to accommodate the request for change can be searchedfor.

At block 116, responsive to the determining step 112 indicating thatsufficient time is indeed available, a change plan can be developed. Thechange plan can include, e.g., one or more of: (i) preparationinformation indicative of replica locations, relationships, and timing;(ii) execution information indicative of replication performance; (iii)failover detection information indicative of how to execute necessaryfailover actions no later than a time of an action associated with therequest for change; (iv) completion information indicative ofreplication relationship maintenance and discard; and (v) informationindicative of how to execute necessary failback actions no earlier thana time of another action associated with the request for change.

For example, with regard to (iii), the plan can provide the details ofhow to execute the necessary failover actions prior to or at the time ofa failure or maintenance action, i.e., the switch over from the originalstorage volumes to their replicas on functioning storage controllers.With regard to (v), the plan can provide the details of how to executethe necessary failback actions at the time or after recovery orcompletion of a maintenance action, i.e., the switch over from replicasto the original volumes.

The concept of taking replication time into account can be implementedseparately from, or together with, the concept of using bothprobabilistic and deterministic information. Thus, in one or moreexemplary embodiments, an inventive method could include steps ofobtaining a request for change as at block 102, obtaining an estimatedreplication time associated with a replication to accommodate thechange, as at block 110, and taking the estimated replication time intoaccount in evaluating the request for change. The latter can include,e.g., one or more of steps 112, 114, and 116. In this invention, wefocus on datasets whose requirements (in terms of either capacity, orperformance, or availability) exceed the capabilities of individualstorage controllers and thus must be spread over multiple storagecontrollers.

It will be appreciated that in storage systems that comprise multiplestorage controllers (an architecture often referred to as “Scale-Out”),one faces the problem of maintaining desired dataset availability levelsin the face of storage controller downtime. Downtime can be causedeither by scheduled maintenance actions or by unexpected failure of oneor more storage controllers. One reason that the benefits of the RAIDtechnology cannot simply extend from groups of multiple DASDs to groupsof multiple storage controllers is that storage controllers aresignificantly more complicated devices than individual DASDs and ingeneral tend to exhibit downtime more frequently and for a wider varietyof reasons (besides component failure) compared to DASDs; in addition tothe complexity of individual storage controllers, groups of storagecontrollers in data centers are typically more heterogeneous than groupsof DASDs inside RAID arrays. Given this degree of complexity andheterogeneity, the problem of deciding the right amount of datareplication (how many data replicas to create and on which storagecontrollers to place them) for a given dataset, as well as how to reactto storage controller unavailability, can be effectively addressed byone or more inventive embodiments in a process that takes these factors(i.e., storage controller complexity, heterogeneity) into account.

One or more embodiments of the invention may offer a proactive solutionto maintaining the availability levels of datasets by dynamically andcontinuously determining the availability of individual storagecontrollers using a combination of statistical (probabilistic) anddeterministic methods. Given such availability characterization ofindividual controllers, one or more exemplary inventive methods canperiodically analyze the impact of probable or anticipated changes andcome up with a change plan to maintain the availability goals ofdatasets. This can typically be accomplished without conflicting withexisting reactive high-availability systems, such as RAID; in fact, oneor more inventive embodiments can co-exist with and leverage thesesystems, which typically operate within individual storage controllers.

The probabilistic methods used in one or more embodiments of theinvention can take into account past observations of controlleravailability (e.g., how many and what type of unavailability intervalshas each controller undergone in the past), operator beliefs (e.g.,operator believes that controller is vulnerable during a probationperiod immediately after it has undergone a firmware upgrade), as wellas the state of storage controller configuration (e.g., how many standbyand/or hot spare DASDs are currently available to mask an active-DASDfailure; how many redundant storage controller system-boards andinternal data-paths between controller system-boards and DASD arrays arein place) in coming up with a probabilistic estimate of futureavailability.

The deterministic methods employed in one or more embodiments of theinvention take into account exact information about forthcoming changes,such as scheduled storage controller maintenance actions, which can besubmitted by system operators and/or administrators via theaforementioned RFC.

One or more embodiments of the invention can combine controlleravailability measures estimated by the deterministic and probabilisticmethods and come up with a volume placement plan (i.e., how manyreplicas to create and on which controllers to place them) and a changemanagement plan (i.e., what type of failover and failback actions toinvoke as a response to controllers going out of service or returning toservice).

Still with reference to FIG. 1, it will be appreciated that the “Decidehow/where to replicate” block 108 and “Output Change Plan” block 116 areof significance for the system administrator and/or manager. The “AddRFC” block 102 and the “Estimate controller failure probability” block104 can be thought of as triggers. One or more inventive embodiments cancome up with a placement and a change plan which are feasible.

In an exemplary embodiment, we can assume that a dataset is implementedas a collection of storage volumes VG={v₁, v₂, . . . , v_(n)}, spreadover multiple storage controllers. A dataset can potentially beaccessible to one or more host servers and used by applicationsinstalled on these servers.

The desirable availability level of a dataset can be expressed as theratio of the expected “uptime” (i.e., the time that the dataset is orexpected to be accessible to applications on host servers) over thetotal amount of time considered. The dataset is considered unavailable(i.e., inaccessible to applications) if at least one storage volume inthe collection VG is unavailable. For example, if T is the present timeand Δt=t−T is a future time interval (e.g., a day, a month, or a year),over which the dataset is unavailable for time Δt_(outage), then theavailability of a data set is defined as:

Availability=(T−t−Δt _(outage))/(T−t)  (1)

For the purpose of this description, dataset availability is measured asa percentile; for example, availability of 99.99% or 0.9999 (otherwisereferred to as “four 9s”) over a period of a month means that themaximum tolerated downtime cannot exceed about five minutes. The outagein the above formula can be caused by downtime of storage controllers,which may or may not have been anticipated. Anticipated downtime can becaused by scheduled maintenance operations. Unanticipated downtime istypically caused by failures of hardware or software components or byoperator errors. One or more inventive methods can rely on thecontinuous characterization of the availability of individual storagecontrollers, based on deterministic and probabilistic calculations.

Deterministic Storage Controller Availability Estimate

In deterministic calculations, one or more embodiments of the inventionuse exact information about future downtime at time t_(i) and for aduration Δt_(i) to calculate the operational availability (representedby the symbol A_(d)) of a dataset based on estimates of themean-time-between-maintenance (MTBM) and the mean-downtime (MDT)measures, as follows:

A _(d) =MTBM/(MTBM+MDT).  (2)

Probabilistic Storage Controller Availability Estimate

In probabilistic calculations, one or more embodiments of the inventioncombine information such as:

-   a. Statistical characterization of the past behavior of controllers    (i.e., using a controller's past uptime as an indication of its    future expected uptime). This estimate will henceforth be    represented by the symbol μ_(α).-   b. The availability characterization of a storage controller. This    characterization, however, can involve a variety of factors such    as: (i) the ability of the current configuration of the storage    controller to sustain a number of faults (e.g., how many spare DASDs    are currently in place and available and/or tested to mask an    active-DASD failure); (ii) the availability behavior of the    controller software (e.g., what is the degraded performance behavior    of the controller when handling a failover action); (iii) the number    of operational redundant data-paths within the controller; (iv) the    number of operational redundant data-paths between the controller    and the host(s) accessing data on that controller. In one or more    embodiments of the invention we encapsulate our belief in a    controller's availability as a function of its operating    configuration as a Bayesian probability. This estimate will    henceforth be represented by the symbol μ_(β).-   c. Operator belief regarding a controller's future availability,    taking into account empirical rules such as the “infant mortality”    effect, according to which a controller is considered to be    particularly vulnerable to a failure in the time period immediately    after a software upgrade. Similar to (b), this is also a Bayesian    probability. This estimate will henceforth be represented by the    symbol μ_(γ). The skilled artisan is familiar with the concept of    infant mortality from, for example, J. Gray, “Why do Computers Stop    and What Can We Do About It?,” in Proceedings of the 6^(th)    International Conference on Reliability and Distributed Databases,    June 1987.

The probability μ that a controller will be available in the future canbe estimated from (a)-(c) above. This estimate of controlleravailability can be used in probabilistic calculations to derive aprobabilistic estimate for the availability of an entire data set.

Statistical/probabilistic and deterministic information as describedabove can be used to estimate the degree of availability of a storagecontroller. There are multiple options regarding how to combine thesesources of information. By way of example, one option is to take theminimum estimate among the deterministic estimate and (a)-(c).

Controller Availability=min (μ_(α),μ_(β), μ_(γ) ,A _(d)).  (3)

Binding the estimate of controller availability to the strictestestimate available, as expressed in the formula above, is expected towork well. Storage Volume Allocation and Placement Algorithms

In what follows, exemplary inventive techniques for placing a set ofstorage volumes on a set of storage controllers in order to achieve acertain availability goal are presented, based on volume-levelreplication (i.e., each volume potentially being continuouslyreplicated—or mirrored—to one or more other storage volumes on zero ormore other storage controllers). According to this embodiment, for eachvolume v_(i) in a dataset that comprises a set of volumes VG (i.e., theset of primary volumes), v_(i) may be replicated one or more times tovolumes v_(i1), v_(i2), etc., which are members of a replica-set VG′.

VG={v₁,v₂, . . . ,v_(n)} VG′={v₁₁,v₁₂, . . . ,v₂₁,v₂₂, . . . ,v_(n1),v_(n2), . . . }.  (4)

Note that even though a volume may be replicated one or more times,outage (i.e., data inaccessibility) is still possible on the datasetwhen the storage controller that contains the primary volume fails or istaken out of service. This outage is unavoidable in most cases, and dueto the amount of time it takes to failover to the secondary storagevolume replica. This time depends on the replication technology and theparticular storage infrastructure used.

One problem that can be addressed in this embodiment is thedetermination of the number of storage volumes (i.e., number of primaryvolumes and replicas), as well as their placement on storagecontrollers, to achieve the set availability goals. A two phasesapproach can be used: (i) in the first phase, the initial placement ofvolumes is decided based on capacity and performance goals only,producing the set VG and the mapping between volumes in VG and storagecontrollers, and (ii) in the second phase, the storage volumes arereplicated as necessary to achieve the data availability goals. Thisphase results in the set VG′ as well as the mapping between volumereplicas and storage controllers.

Initial Placement Phase

The first (initial placement) phase can be performed purely based oncapacity and performance considerations and using known methods, such asthe well-known and aforementioned IBM Volume Performance Advisor. Suchplacement of volumes to storage controllers, however, may not fullysatisfy the availability goals for the dataset, which is why a second(data replication) phase may be necessary.

Data Replication Phase

Following the initial placement phase, data replication can be used toachieve the availability goals. This embodiment determines the degree ofdata replication necessary to achieve the availability goals (e.g., howmany replicas of a volume are needed) as well as the placement (e.g.,which storage controller to place a volume replica on). In addition, animplementation plan for executing these changes is presented.

One principle in this phase is to progressively improve the overallavailability of a dataset by iteratively replicating storage volumesacross storage controllers until the availability goal is reached. Theprocess starts by calculating the initial—baseline—availability of thedataset VG without any replication. The availability can then beimproved by selecting a storage volume from a storage controller with alow degree of availability (preferably, the lowest between anycontroller with volumes in VG) and deciding on which controller toreplicate this volume to increase the overall dataset availability. Theavailability can further be improved by replicating other volumes or byreplicating certain volumes more than once. By iteratively growing theset VG′ (by selecting a volume in VG and replicating it on some othercontroller) one can monotonically improve the availability of thedataset until the availability goal is eventually reached.

In general, given a storage volume A, the choice of the controller thatcan host a replica of A (henceforth the replica is referred to as A′) ismade using the following criteria. First, to minimize cost, it should bea storage controller with similar or lower availability if possible(i.e., it need not be a controller offering a much higher quality ofservice). Note that controller availability is estimated using thecombination of deterministic and probabilistic methods describedearlier. Second, simple deterministic considerations dictate that thescheduled outages of the two controllers hosting A, A′ should not beoverlapping at any point in time (see Timelines in FIGS. 4 and 5 ofExample 2). In addition, there must be sufficient time distance betweenany outages for volumes A, A′ to make their re-synchronization possibleafter a failure. The time necessary to synchronize two storage volumereplicas is estimated taking into account the replication technology,the amount of data that needs to be transferred, and the data transferspeed.

Given a set of storage controllers that could be potential candidates tohost replica A′, this embodiment examines each candidate controller insome order (e.g., in random order) and determines whether theavailability of the resulting dataset (calculated using the combinedprediction of the deterministic and probabilistic methods) achieves thedesired target. Besides the use of probabilistic formulas asdemonstrated in Example 1 below, a potentially more accurate way toestimate overall availability is the use of simulation-based DecisionAnalysis, which was used in the aforementioned Crespo reference for thedesign of archival repositories. Such an analysis would be based onevent-based simulations using the probabilistic estimates of storagecontroller availability (sources (a)-(c) described earlier). A drawbackof this method is that it may not be suitable in an online scenariowhere near-immediate response is needed. In those cases, straightforwarduse of the probabilistic formulas (as described in Example 1 below) maybe more appropriate.

The process that was just described can be repeated for all storagevolumes in a dataset VG to eventually produce the set VG′ and theassociated mappings between storage volumes and storage controllers. Incases where no singly-replicated solution (i.e., where no volume canhave more than one replica) exists that achieves the availability goalof a dataset, an alternative option is to attempt a solution where someof the volumes are doubly (or higher) replicated, on three (or more)controllers.

In general, one of the volumes in a set of replicas is designated as theprimary; this is the replica that is to be the first activerepresentative of the set. This is usually determined to be the storagevolume on the controller with the latest scheduled downtime.

After the initial placement of volumes from VG and VG′ to storagecontrollers, the exemplary method periodically checks whether theavailability goals of the data set are maintained in light of the mostrecent and up-to-date availability characterizations of storagecontrollers and RFCs submitted by operators (refer back to FIG. 1 forone example). When the availability level of a storage controllerchanges (e.g., it either goes into the “watch-list” due to someconfiguration change or one or more RFCs on it are submitted), theavailability level of one or more datasets (those who have one or morestorage volume replicas stored on that controller) is expected to changeand thus new storage volume replicas will most likely have to becreated.

One particularly interesting case in practice is that of “draining”(i.e., removing all volumes from) a storage controller. In this case,all storage volumes from that storage controller must be moved to othercontrollers, which may further require the creation and placement ofreplicas. This case can be treated using the general process describedearlier. Note however, that the migration of storage volumes betweencontrollers involves data movement, which can be a slow process forlarge volumes.

When an administrator wants to introduce a new controller outage with anRFC(t, Δt), an alternative time t may be proposed by RAIC if that willresult in significantly lower system impact (e.g., fewer replicacreations or less data movement). If the operator/administrator insistson the original RFC specification, new replicas will proactively bebuilt to guard against data unavailability at the expense of disk spacededicated to redundant data.

When a controller returns into service after being inaccessible, itsstorage volumes must typically be re-synchronized with any of replica(s)they may have on other controllers. Note that continuous replication andre-synchronization can be performed in the background and do notdirectly affect availability.

By way of review, the method as described above can be visualized inconnection with the flow chart of FIG. 1. The proposed embodimentmaintains information about availability characteristics of storagecontrollers (as described earlier) and listens for (a) requests forchange RFC(t, Δt) in the status of a storage controller; an RFC may ormay not specify the time of the change t but should specify an estimateon the duration of change Δt; (b) other events that may signal changesin the availability characteristics of storage controllers; examplesinclude hardware or software controller failures, operator errors, orrevised beliefs on the controller's ability to sustain failures.

For each submitted controller RFC or updated information about systemstatus, the method checks whether any availability goals are violated.If so, volume(s) may be replicated as necessary. Besides purelyavailability-based estimates, the replication plan may also reflectbusiness rules based on policy, e.g., use higher-quality controllers forimportant data.

A typical availability management plan includes three phases: PREPARE,EXECUTE, and COMPLETE. These phases handle the mechanics of implementingthe solution proposed by the techniques described earlier and arespecific to the replication technologies used in the particular deployedinfrastructure.

The PREPARE phase is relevant prior to a controller failure or shutdownand involves creating and synchronizing replicas and setting up thenecessary replication relationships.

The EXECUTE phase becomes relevant at the time that a controller failsor shuts down and involves handling the failover action to secondaryreplicas. The objective of this phase is to re-establish theavailability of a dataset by masking volume failures and by redirectingdata access to surviving storage volumes.

The COMPLETE phase becomes relevant at the time a storage controllerthat was previously taken out of service recovers and re-enters service.This phase involves resuming (or discarding, if deemed to be stale)replication relationships, re-synchronizing storage volumes, andoptionally “failing back” data access into the recovered storagevolumes.

Following are examples of both probabilistic and deterministicavailability calculations. These are purely illustrative in nature, toaid the skilled artisan in making and using one or more inventiveembodiments, and are not to be taken as limiting.

EXAMPLE 1

With reference to FIGS. 2 and 3, consider five storage controllers A, B,C, D, E with availability μ_(A), μ_(B), μ_(C), μ_(D), μ_(E),respectively. In this example, we allocate a dataset of size x GB withperformance y IO/s and overall availability μ using the volumeallocation and placement procedure described earlier. In the first(initial allocation) phase, seven volumes are allocated based oncapacity and performance considerations on three controllers (A, B, C).Following initial allocation, in the second phase of the allocationalgorithm, volumes from controllers B and C (presumably the controllerswith the lowest availabilities μ_(B) and μ_(C)) are selected to bereplicated to equal number of volumes on controllers D, E.

The estimate of the overall probability is based on the followingtheorem from the Theory of Probabilities, which states that for any twoevents A and B, the probability that either A or B or both occur isgiven by:

Pr{A or B}=Pr{A}+Pr{B}−Pr{A and B}  (5)

Assuming A and B are independent events:

Pr{A or B}=Pr{A}+Pr{B}−Pr{A}×Pr{B}.  (6)

Assuming that storage controllers fail independently and that a pair ofcontrollers is unavailable if both controllers are unavailable:

$\begin{matrix}{\quad\begin{matrix}{{Availability} = {\Pr \left\{ {{A\mspace{14mu} \underset{\_}{and}\mspace{14mu} {Pair}\mspace{14mu} B},{E\mspace{14mu} \underset{\_}{and}\mspace{14mu} {Pair}\mspace{14mu} C},{D\mspace{14mu} {are}\mspace{14mu} {available}}} \right\}}} \\{= {1 - {\Pr \left\{ {{A\mspace{14mu} \underset{\_}{or}\mspace{14mu} {Pair}\mspace{14mu} B},{E\mspace{14mu} \underset{\_}{or}\mspace{14mu} {Pair}\mspace{14mu} C},{D\mspace{14mu} {are}\mspace{14mu} {unavailable}}} \right\}}}} \\{= {1 - {\Pr \left\{ {A\mspace{14mu} {unavailable}} \right\}} - {\Pr \left\{ {{{Pair}\mspace{14mu} B},{E\mspace{14mu} \underset{\_}{or}\mspace{14mu} {Pair}\; C},D}\mspace{31mu} \right.}}} \\{\left. {{are}\mspace{11mu} {unavailable}} \right\} + {\Pr \left\{ {A\mspace{14mu} {unavailable}} \right\} \times \Pr \left\{ {{{Pair}\mspace{14mu} B},{E\mspace{14mu} \underset{\_}{or}}} \right.}} \\\left. {{{Pair}\mspace{14mu} C},{D\mspace{14mu} {are}\mspace{14mu} {unavailable}}} \right\} \\{= {1 - \mu_{A} - {\Pr \left\{ {{\left( {1 - \mu_{B}} \right)\left( {1 - \mu_{E}} \right)} + {\left( {1 - \mu_{C}} \right)\left( {1 - \mu_{D}} \right)} -} \right.}}} \\{\left. {\left( {1 - \mu_{B}} \right)\left( {1 - \mu_{C}} \right)\left( {1 - \mu_{D}} \right)\left( {1 - \mu_{E}} \right)} \right\} +} \\{{\mu_{A} \times \Pr \left\{ {{\left( {1 - \mu_{B}} \right)\left( {1 - \mu_{E}} \right)} + {\left( {1 - \mu_{C}} \right)\left( {1 - \mu_{D}} \right)} -} \right.}} \\\left. {\left( {1 - \mu_{B}} \right)\left( {1 - \mu_{C}} \right)\left( {1 - \mu_{D}} \right)\left( {1 - \mu_{E}} \right)} \right\}\end{matrix}} & (7)\end{matrix}$

The above formula (or a similarly derived formula adapted to a givenconfiguration of replication relationships and number and type ofcontrollers) can be used to determine the availability of the data set.

By way of review, in FIG. 2, a data set can be spread over five storagecontrollers (labeled A-D). Each controller is characterized byavailability μ_(A)-μ_(E). The data set comprises 7 storage volumesinitially spread over three storage controllers (A, B, and C); thisinitial allocation is decided based on capacity (xGB) and performance(e.g., y IOs/sec) goals, using known techniques such as VPA. Theavailability goal (μ) of the data set can be satisfied as describedherein. The resulting placement satisfying the availability goal (μ) isshown in FIG. 3.

EXAMPLE 2

Referring to FIGS. 4 and 5, consider three datasets with differentavailability goals (0.9999, 0.999, and 0.9 from top to bottom). Consideralso seven controllers (SC1-SC7) each with a different availabilityoutlook expressed in their timeline of known, expected outages. Thetimelines in FIG. 4 describe the known, expected outages for eachcontroller. For the first six controllers we use mostly deterministicinformation. The last controller (SC7), for which there is no knownoutage, is considered suspect due to a recent firmware upgrade. Itsprobabilistic availability estimate is therefore low.

As in the previous example, in the first (initial allocation) phase ofthe algorithms, volumes for each dataset are assigned to storagecontrollers based on capacity and performance goals. For the firstdataset, a single volume (A) is allocated on SC1. For the seconddataset, three volumes (B, C, and D) are allocated on SC3-SC5. Finally,for the third dataset a single volume (E) is allocated SC7.

In this example, an additional effort is made in the initial phase totry to perform the initial allocation on a storage controller whoseavailability is as close as possible to the overall dataset availabilitygoal. For example, volume A is assigned to the controller with thehighest availability (SC1) since that controller most closelyapproximates (but falls short off) the dataset's availability goal.

Once the initial allocations are complete, in the second phase of thealgorithm we turn our attention to using volume replication to satisfythe availability goals. Observing that the availability goal of thefirst volume group is quite ambitious (four 9's over one month impliesoutage of only about 4-5 minutes over the same time interval), storagevolume A must be replicated on another highly-available controller. Thetechnique thus selects storage controller SC2 for hosting A′, thereplica of A). Similarly, for the second data set, the algorithm choosesstorage controllers SC4, SC3, and SC6 to progressively improve theavailability of that data set by replicating volumes B, C, and D (to B′,C′, and D′), respectively. Finally, the algorithm selects SC5 toreplicate volume E on SC7 and reach the availability goal of the thirddataset.

By way of review, in FIG. 4, there are three volume groups withdifferent availability goals (μ=0.9999 for the VG that includes storagevolume A, μ=0.999 for the VG that comprises B, C, and D, and μ=0.9 forthe VG that includes volume E). These volumes must be placed on anysubset of 7 controllers, listed in order of availability. Each timelinedescribes the outages for each controller. We first assign volumes tostorage controllers based on capacity and performance goals, whiletrying to get as close as we can to the availability goal. Four 9's overone month implies outage of only about 4-5 minutes. The controllers withthe highest operational availability cannot provide this kind ofservice, so a storage volume must be replicated on two such controllers.In choosing a controller for the replica, to minimize cost, it should bea controller with similar availability. The outages should not beoverlapping and there should be sufficient distance between outages forvolumes to be re-synchronized. If a volume falls out of sync and it isscheduled to become primary next, it should be synchronized. In FIG. 5,primed volumes (e.g., A′) designate secondary replicas. The primaryvolume should be on the controller that will have its first failurelater than the other; RAIC should ensure that there is enough timebetween the time that A comes back in service to the time A′ disappearsso that A can go back in sync. Typically, the only outage time thataffects operational availability is the failover time from A′ to A.Everything else can typically be done in the background and thus doesnot affect availability.

A variety of techniques, utilizing dedicated hardware, general purposeprocessors, firmware, software, or a combination of the foregoing may beemployed to implement the present invention. One or more embodiments ofthe invention can be implemented in the form of a computer productincluding a computer usable medium with computer usable program code forperforming the method steps indicated. Furthermore, one or moreembodiments of the invention can be implemented in the form of anapparatus including a memory and at least one processor that is coupledto the memory and operative to perform exemplary method steps.

At present, it is believed that one or more embodiments will makesubstantial use of software running on a general purpose computer orworkstation. With reference to FIG. 6, such an implementation mightemploy, for example, a processor 602, a memory 604, and an input/outputinterface formed, for example, by a display 606 and a keyboard 608. Theterm “processor” as used herein is intended to include any processingdevice, such as, for example, one that includes a CPU (centralprocessing unit) and/or other forms of processing circuitry. Further,the term “processor” may refer to more than one individual processor.The term “memory” is intended to include memory associated with aprocessor or CPU, such as, for example, RAM (random access memory), ROM(read only memory), a fixed memory device (e.g., hard drive), aremovable memory device (e.g., diskette), a flash memory and the like.In addition, the phrase “input/output interface” as used herein, isintended to include, for example, one or more mechanisms for inputtingdata to the processing unit (e.g., mouse), and one or more mechanismsfor providing results associated with the processing unit (e.g.,printer). The processor 602, memory 604, and input/output interface suchas display 606 and keyboard 608 can be interconnected, for example, viabus 610 as part of a data processing unit 612. Suitableinterconnections, for example via bus 610, can also be provided to anetwork interface 614, such as a network card, which can be provided tointerface with a computer network, and to a media interface 616, such asa diskette or CD-ROM drive, which can be provided to interface withmedia 618.

Accordingly, computer software including instructions or code forperforming the methodologies of the invention, as described herein, maybe stored in one or more of the associated memory devices (e.g., ROM,fixed or removable memory) and, when ready to be utilized, loaded inpart or in whole (e.g., into RAM) and executed by a CPU. Such softwarecould include, but is not limited to, firmware, resident software,microcode, and the like.

Furthermore, the invention can take the form of a computer programproduct accessible from a computer-usable or computer-readable medium(e.g., media 618) providing program code for use by or in connectionwith a computer or any instruction execution system. For the purposes ofthis description, a computer usable or computer readable medium can beany apparatus for use by or in connection with the instruction executionsystem, apparatus, or device.

The medium can be an electronic, magnetic, optical, electromagnetic,infrared, or semiconductor system (or apparatus or device) or apropagation medium. Examples of a computer-readable medium include asemiconductor or solid-state memory (e.g. memory 604), magnetic tape, aremovable computer diskette (e.g. media 618), a random access memory(RAM), a read-only memory (ROM), a rigid magnetic disk and an opticaldisk. Current examples of optical disks include compact disk-read onlymemory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.

A data processing system suitable for storing and/or executing programcode will include at least one processor 602 coupled directly orindirectly to memory elements 604 through a system bus 610. The memoryelements can include local memory employed during actual execution ofthe program code, bulk storage, and cache memories which providetemporary storage of at least some program code in order to reduce thenumber of times code must be retrieved from bulk storage duringexecution.

Input/output or I/O devices (including but not limited to keyboards 608,displays 606, pointing devices, and the like) can be coupled to thesystem either directly (such as via bus 610) or through intervening I/Ocontrollers (omitted for clarity).

Network adapters such as network interface 614 may also be coupled tothe system to enable the data processing system to become coupled toother data processing systems or remote printers or storage devicesthrough intervening private or public networks. Modems, cable modem andEthernet cards are just a few of the currently available types ofnetwork adapters.

In any case, it should be understood that the components illustratedherein may be implemented in various forms of hardware, software, orcombinations thereof, e.g., application specific integrated circuit(s)(ASICS), functional circuitry, one or more appropriately programmedgeneral purpose digital computers with associated memory, and the like.Given the teachings of the invention provided herein, one of ordinaryskill in the related art will be able to contemplate otherimplementations of the components of the invention.

Although illustrative embodiments of the present invention have beendescribed herein with reference to the accompanying drawings, it is tobe understood that the invention is not limited to those preciseembodiments, and that various other changes and modifications may bemade by one skilled in the art without departing from the scope orspirit of the invention.

1. A method for controlling a computer storage system, comprising thesteps of: obtaining deterministic component availability informationpertaining to said system; obtaining probabilistic componentavailability information pertaining to said system; and checking forviolation of availability goals based on both said deterministiccomponent availability information and said probabilistic componentavailability information.
 2. The method of claim 1, further comprisingthe additional step of maintaining a current status, responsive to saidchecking indicating no violation of said availability goals.
 3. Themethod of claim 1, further comprising the additional step of determiningreplication parameters, responsive to said checking indicating aviolation of said availability goals.
 4. The method of claim 3, whereinsaid replication parameters comprise at least how to replicate and whereto replicate.
 5. The method of claim 3, wherein said obtainingdeterministic component availability information pertaining to saidsystem comprises obtaining a request for change, further comprising theadditional step of obtaining an estimated replication time.
 6. Themethod of claim 5, further comprising the additional step of taking saidestimated replication time into account in evaluating said request forchange.
 7. The method of claim 6, wherein said taking into accountcomprises: determining whether sufficient time is available to replicateto accommodate said request for change; and responsive to saiddetermining step indicating that sufficient time is not available,rejecting said request for change.
 8. The method of claim 6, whereinsaid taking into account comprises: determining whether sufficient timeis available to replicate to accommodate said request for change; andresponsive to said determining step indicating that sufficient time isnot available, searching for an alternative plan to accommodate saidrequest for change.
 9. The method of claim 6, wherein said taking intoaccount comprises: determining whether sufficient time is available toreplicate to accommodate said request for change; and responsive to saiddetermining step indicating that sufficient time is available,developing a change plan.
 10. The method of claim 9, wherein said changeplan comprises: preparation information indicative of replica locations,relationships, and timing.
 11. The method of claim 10, wherein saidchange plan further comprises: execution information indicative ofreplication performance; and failover detection information indicativeof how to execute necessary failover actions no later than a time of anaction associated with said request for change.
 12. The method of claim11, wherein said change plan further comprises: completion informationindicative of replication relationship maintenance and discard, andinformation indicative of how to execute necessary failback actions noearlier than a time of another action associated with said request forchange.
 13. A method for controlling a computer storage system,comprising the steps of: obtaining a request for change; obtaining anestimated replication time associated with a replication to accommodatesaid change; and taking said estimated replication time into account inevaluating said request for change.
 14. The method of claim 13, whereinsaid taking into account comprises: determining whether sufficient timeis available to replicate to accommodate said request for change; andresponsive to said determining step indicating that sufficient time isnot available, rejecting said request for change.
 15. The method ofclaim 13, wherein said taking into account comprises: determiningwhether sufficient time is available to replicate to accommodate saidrequest for change; and responsive to said determining step indicatingthat sufficient time is not available, searching for an alternative planto accommodate said request for change.
 16. The method of claim 13,wherein said taking into account comprises: determining whethersufficient time is available to replicate to accommodate said requestfor change; and responsive to said determining step indicating thatsufficient time is available, developing a change plan.
 17. A computerprogram product comprising a computer useable medium having computeruseable program code for controlling a computer storage system, saidcomputer program product including: computer useable program code forobtaining deterministic component availability information pertaining tosaid system; computer useable program code for obtaining probabilisticcomponent availability information pertaining to said system; andcomputer useable program code for checking for violation of availabilitygoals based on both said deterministic component availabilityinformation and said probabilistic component availability information.18. The computer program product of claim 17, wherein said computeruseable program code for obtaining deterministic component availabilityinformation pertaining to said system comprises computer useable programcode for obtaining a request for change, further comprising: computeruseable program code for obtaining an estimated replication time; andcomputer useable program code for taking said estimated replication timeinto account in evaluating said request for change.
 19. A computerprogram product comprising a computer useable medium having computeruseable program code for controlling a computer storage system, saidcomputer program product including: computer useable program code forobtaining a request for change; computer useable program code forobtaining an estimated replication time associated with a replication toaccommodate said change; and computer useable program code for takingsaid estimated replication time into account in evaluating said requestfor change.
 20. The computer program product of claim 19, wherein saidcomputer useable program code for taking said estimated replication timeinto account comprises: computer useable program code for determiningwhether sufficient time is available to replicate to accommodate saidrequest for change; and computer useable program code for rejecting saidrequest for change, responsive to said computer useable program code fordetermining indicating that sufficient time is not available.